This is a course on Open Data Science.
More information can be found from [here]:(https://studies.helsinki.fi/courses/course-implementation/hy-opt-cur-2324-6bc9901d-e80d-4ba7-8bce-829e42c15521/PHD-302)
date()
## [1] "Mon Dec 11 19:44:43 2023"
I have only completed so far reading and exercises from chapter 1 and 2 and I’m panicking a little. For me it was difficult to understand how the course starts and I was confused what to complete. Also, I didn’t remember how much work 5 credits course can be :) I expect to learn how to use Github and understand R code. I heard about the course from university email.
The crash course is excellent, easy to follow and understand, but I should have had more time to go it trough, because everything was new to me. So far I have only completed chapters 1 and 2.
Github My GitHub repository is here
This week I have completed the Excercise2.Rmd, Data wrangling excercise and Data analysis where I learned some data wrangling, like modifying data sets, extracting and combining desired variables into a new data set, how to check the data, do some statistical tests and visualizations. I also spent a lot of time learning new words and concepts to better understand these statistical analyses.
date()
## [1] "Mon Dec 11 19:44:43 2023"
students2014 <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt")
dim(students2014)
## [1] 166 7
str(students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
# students2014 is available
# Access the gglot2 library
library(ggplot2)
# Summary of numeric variables
summary(students2014)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
# initialize plot with data and aesthetic mapping
p1 <- ggplot(students2014, aes(x = attitude, y = points))
# define the visualization type (points)
p2 <- p1 + geom_point()
# draw the plot
p2
# add a regression line
p3 <- p2 + geom_smooth(method = "lm")
# add a main title
p4 <- p3 + ggtitle("Student's attitude versus exam points")
# draw the plot!
p4
## `geom_smooth()` using formula = 'y ~ x'
# students2014 is available
# Access the gglot2 library
library(ggplot2)
model <- lm(points ~ age + attitude + deep, data = students2014)
# Summary of the fitted model
summary(model)
##
## Call:
## lm(formula = points ~ age + attitude + deep, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.206 -3.430 0.204 3.979 10.950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.60773 3.38966 4.605 8.32e-06 ***
## age -0.07716 0.05322 -1.450 0.149
## attitude 3.59413 0.56959 6.310 2.56e-09 ***
## deep -0.60275 0.75031 -0.803 0.423
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.307 on 162 degrees of freedom
## Multiple R-squared: 0.2043, Adjusted R-squared: 0.1896
## F-statistic: 13.87 on 3 and 162 DF, p-value: 4.305e-08
model2 <- lm(points ~ attitude, data = students2014)
# Summary of the fitted model2
summary(model2)
##
## Call:
## lm(formula = points ~ attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
# students2014 and model2 is available
# Place the following 4 graphics to one plot with plot layout of 2x2
par(mfrow = c(2,2))
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
plot(model2, which = c(1, 2, 5))
Plot on Residuals vs Fitted Values: If the relationship between fitted values and residuals are uncorrelated, they should form linear model. So the relationship between residuals and fitted values is random and without a clear pattern. If it something else than linear (like curve or funnel) it indicates broken linearity assumption. –> Conclusions: Points are scattered and no specific pattern is shown. Plot seems also linear. Linearity assumption is valid.
Normal QQ-plot: Normal QQ-plot is used to check whether or not a set of data follows a normal distribution. If it does, the points should form a straight line. However, is the points deviate the data is less likely to follow a normal distribution.–> Conclusions: The points follow the diagonal line pretty well, but some deviations can be seen in the “tails”. Still, it’s ok to assume that the data is normally distributed.
Residuals vs Leverage: This plot shows each observation as a single point within the plot. The x-axis shows the leverage of each point and the y-axis shows the standardized residual of each point. Leverage is the change in the coefficients of the regression model if a particular observation is removed from the data.The standardised residual refers to the standardised difference between the predicted value of an observation and the actual value of the observation. –> Conclusions: The residuals are spread relatively constantly and no influential points can’t be observed.
Done!
This week the aim is to statistically investigate a dataset related to student performance and alcohol consumpion in secondary education of two Portuguese schools.
date()
## [1] "Mon Dec 11 19:44:45 2023"
Data set is available here
# Read the data table in R from liked file and save it as "alc" using comma as the column separator. First row includes column names.
alc <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", sep=",", header=TRUE)
# Load required libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
# print out the names of the variables in the data
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
This dataset is downloaded from the UCI Machine Learning Repository and related to student performance in secondary education of two Portuguese schools. The data is orginated from reports and questionnaires and contains variables that include features like age, gender, family background, study time, and various academic performance metrics (e.g., grades). The dataset contains also information about the students’ alcohol consumption during weekdays and weekends.
Lets choose following variables to study their relationship with alcohol consumption : study time, sex and age.
First let’s look at the distribution of alcohol consumption (‘Walc’ for weekends and ‘Dalc’ for weekdays) and its relationship with the variable ‘studytime’.
# Explore the dataset structure
glimpse(alc)
## Rows: 370
## Columns: 35
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ paid <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
# data has 370 rows and 35 columns
# Summary statistics of relevant variables
summary(alc$Walc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.295 3.000 5.000
summary(alc$Dalc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.484 2.000 5.000
summary(alc$studytime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.043 2.000 4.000
# Cross-tabulation of studytime and Walc
cross_tab_walc_studytime <- table(alc$studytime, alc$Walc)
print(cross_tab_walc_studytime)
##
## 1 2 3 4 5
## 1 22 24 19 20 13
## 2 72 38 40 24 11
## 3 30 14 13 2 1
## 4 17 4 3 1 2
Because the Mean value for Walc is slightly bigger than for Dalc, let’s focus on weekly alcohol consumption
# Bar plot of average Walc for each level of studytime
alc_studytime_bar_plot <- ggplot(alc, aes(x = as.factor(studytime), y = Walc)) +
geom_bar(stat = "summary", fun = "mean", fill = "skyblue") +
labs(title = "Average Weekend Alcohol Consumption by Study Time",
x = "Study Time",
y = "Average Weekend Alcohol Consumption")
print(alc_studytime_bar_plot)
# Box plot of Walc by studytime
alc_studytime_box_plot <- ggplot(alc, aes(x = as.factor(studytime), y = Walc)) +
geom_boxplot(fill = "skyblue") +
labs(title = "Weekend Alcohol Consumption by Study Time",
x = "Study Time",
y = "Weekend Alcohol Consumption")
print(alc_studytime_box_plot)
Alcohol consumption vary across different levels of study time and it seems that students with higher study times have lower average alcohol consumption.
Then let’s look at the distribution of alcohol consumption (‘Walc’ for weekends and ‘Dalc’ for weekdays) and its relationship with the variable ‘sex’.
# Summary statistics of relevant variables
summary(alc$sex)
## Length Class Mode
## 370 character character
# Cross-tabulation of sex and Walc
cross_tab_sex <- table(alc$sex, alc$Walc)
print(cross_tab_sex)
##
## 1 2 3 4 5
## F 87 48 43 13 4
## M 54 32 32 34 23
# Bar plot of average Walc for each level of sex
alc_sex_bar_plot <- ggplot(alc, aes(x = as.factor(sex), y = Walc)) +
geom_bar(stat = "summary", fun = "mean", fill = "skyblue") +
labs(title = "Average Weekend Alcohol Consumption by sex",
x = "sex",
y = "Average Weekend Alcohol Consumption")
print(alc_sex_bar_plot)
# Box plot of Walc by sex
alc_sex_box_plot <- ggplot(alc, aes(x = as.factor(sex), y = Walc)) +
geom_boxplot(fill = "skyblue") +
labs(title = "Weekend Alcohol Consumption by sex",
x = "sex",
y = "Weekend Alcohol Consumption")
print(alc_sex_box_plot)
Average weekend alcohol consumption seems to be slightly higher with males than females.
Lastly, let’s look at the distribution of alcohol consumption (‘Walc’ for weekends) and its relationship with the variable ‘age’.
# Summary statistics of age
summary(alc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 16.00 17.00 16.58 17.00 22.00
# Cross-tabulation of age and Walc
cross_tab_age <- table(alc$age, alc$Walc)
print(cross_tab_age)
##
## 1 2 3 4 5
## 15 46 12 13 7 3
## 16 41 23 15 15 8
## 17 25 23 27 14 8
## 18 25 19 15 11 7
## 19 3 3 5 0 0
## 20 1 0 0 0 0
## 22 0 0 0 0 1
# Bar plot of average Walc for each level of age
alc_age_bar_plot <- ggplot(alc, aes(x = as.factor(age), y = Walc)) +
geom_bar(stat = "summary", fun = "mean", fill = "skyblue") +
labs(title = "Average Weekend Alcohol Consumption by age",
x = "age",
y = "Average Weekend Alcohol Consumption")
print(alc_age_bar_plot)
# Box plot of Walc by age
alc_age_box_plot <- ggplot(alc, aes(x = as.factor(age), y = Walc)) +
geom_boxplot(fill = "skyblue") +
labs(title = "Weekend Alcohol Consumption by age",
x = "age",
y = "Weekend Alcohol Consumption")
print(alc_age_box_plot)
Average weekend alcohol consumption seems to increase slightly from 15 to 17 years old students, but after that decrease until thei are 20 years old and again increase a lot when students are 22 years old. After inspecting the box plot, it seems that the data has some outliers in age groups of 15, 17, 20 and 22 that can influence the overall patterns observed in the relationship between age and alcohol consumption. To investigate this, let’s calculate the correlation coefficient between age and alcohol consumption.
# Calculate the correlation coefficient between age and Walc
correlation_walc_age <- cor(alc$age, alc$Walc)
# Print the correlation coefficient
print(correlation_walc_age)
## [1] 0.1551427
The correlation coefficient will be a value between -1 and 1. If the value is positive, it has positive correlation and if negative it has negative correlation. If the value is either 1 and -1 it indicates a perfect correlation, and 0 indicates no correlation.
The correlation coefficient between age and Walc is [1] 0.1551427. This value is very close to 0, but still positive so it means age has very minor but a positive correlation between alcohol consumption during the weekend.
Let’s perform logistic regression model and use high_use as the binary outcome variable, and study time, sex, and age as the predictor variables. The family = “binomial” argument specifies logistic regression.
# Fit logistic regression model
logistic_model <- glm(high_use ~ studytime + sex + age, data = alc, family = "binomial")
# Summarize the fitted model
summary(logistic_model)
##
## Call:
## glm(formula = high_use ~ studytime + sex + age, family = "binomial",
## data = alc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.6554 1.7142 -2.132 0.03297 *
## studytime -0.4972 0.1601 -3.105 0.00190 **
## sexM 0.7072 0.2474 2.858 0.00426 **
## age 0.2054 0.1008 2.037 0.04161 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 421.24 on 366 degrees of freedom
## AIC: 429.24
##
## Number of Fisher Scoring iterations: 4
Based on this logistic regression model it seems that each predictor variable is associated with the “high_use” of alcohol. The p-values suggest that study time (p-value of 0.00190), sex (p-value of 0.00426), and age (p-value of 0.04161) are statistically significant predictors of alcohol consumption.
The coefficient for sexM and are positive. So when age increases, the “high_use” category (high alcohol consumtpion) also increases. It is also related to sex and males have higher alcohol consumption. The coefficient for studytime is negative, indicating that as study time increases, the log odds of being in the “high_use” category decrease.
# Extract odds ratios and confidence intervals
odds_ratios <- exp(coef(logistic_model))
conf_intervals <- confint(logistic_model)
## Waiting for profiling to be done...
# Print odds ratios and confidence intervals
print("Odds Ratios:")
## [1] "Odds Ratios:"
print(odds_ratios)
## (Intercept) studytime sexM age
## 0.02585034 0.60825171 2.02832763 1.22807258
print("Confidence Intervals:")
## [1] "Confidence Intervals:"
print(conf_intervals)
## 2.5 % 97.5 %
## (Intercept) -7.063114273 -0.3259619
## studytime -0.821067712 -0.1918567
## sexM 0.224462537 1.1962918
## age 0.009734844 0.4061281
Odds ratio for study time is 0.60825171 –> So that means that for each one-unit increase in study time, the odds of being in the “high_use” category decreases by approximately 39.2% (1 - 0.6082).
Odds ratio for sex (sexM) is 2.02832763 –> meaning that males have approximately 2.03 times the odds of being in the “high_use” category compared to females.
Odds Ratio for age is 1.22807258 –> Each one-unit increase in age, means that the odds of being in the “high_use” category increases by approximately 22.8% (1.2281 - 1).
print("Confidence Intervals:")
## [1] "Confidence Intervals:"
print(conf_intervals)
## 2.5 % 97.5 %
## (Intercept) -7.063114273 -0.3259619
## studytime -0.821067712 -0.1918567
## sexM 0.224462537 1.1962918
## age 0.009734844 0.4061281
If confidence intervals includes 0 or 1, it suggests that the parameter is not statistically significant. In this data set, confidence intervals for study time (-0.8211, -0.1919), sex (0.2245, 1.1963) and age (0.0097, 0.4061) do not include 0 for coefficients and do not include 1 for odds ratios –> We can be 95% confident that these variables have a true effect on being in the “high_use” category and thus are statistically significant.
# Make predictions using the logistic regression model
predictions <- predict(logistic_model, type = "response")
# Convert predicted probabilities to binary predictions (0 or 1)
predicted_classes <- ifelse(predictions > 0.5, 1, 0)
# Create a data frame with actual and predicted values
prediction_df <- data.frame(Actual = alc$high_use, Predicted = predicted_classes)
# 2x2 Cross-tabulation
confusion_matrix <- table(prediction_df$Actual, prediction_df$Predicted)
print("Confusion Matrix:")
## [1] "Confusion Matrix:"
print(confusion_matrix)
##
## 0 1
## FALSE 244 15
## TRUE 92 19
# Calculate training error (proportion of inaccurately classified individuals)
training_error <- mean(prediction_df$Actual != prediction_df$Predicted)
print(paste("Training Error:", training_error))
## [1] "Training Error: 0.289189189189189"
# Display a graphic visualizing actual values and predictions
library(ggplot2)
# Scatter plot of age versus studytime
age_studytime_scatter_plot <- ggplot(prediction_df, aes(x = alc$age, y = alc$studytime, color = factor(Predicted))) +
geom_point() +
labs(title = "Scatter Plot of Age vs. Study Time with Predictions",
x = "Age",
y = "Study Time",
color = "Predicted") +
theme_minimal()
print(age_studytime_scatter_plot)
# Remove rows with missing values in the 'high_use' column
alc_no_missing <- alc[complete.cases(alc$high_use), ]
# Compare with a simple guessing strategy for the majority class
majority_class <- as.numeric(names(sort(table(alc_no_missing$high_use), decreasing = TRUE))[1])
## Warning: NAs introduced by coercion
simple_guessing <- rep(majority_class, nrow(alc_no_missing))
# Calculate training error for the simple guessing strategy
simple_guessing_error <- mean(alc_no_missing$high_use != simple_guessing)
print(paste("Simple Guessing Error:", simple_guessing_error))
## [1] "Simple Guessing Error: NA"
library(boot)
# Define the logistic regression model
logistic_model <- glm(high_use ~ studytime + sex + age, data = alc, family = "binomial")
# Define the logistic regression model function for cross-validation
logistic_model_func <- function(data, indices) {
train_data <- data[indices, ]
test_data <- data[-indices, ]
# Fit the logistic regression model on the training data
model <- glm(high_use ~ studytime + sex + age, data = train_data, family = "binomial")
# Make predictions on the test data
predictions <- predict(model, test_data, type = "response")
# Convert predicted probabilities to binary predictions
predicted_classes <- ifelse(predictions > 0.5, 1, 0)
# Return the predicted classes for the test set
return(predicted_classes)
}
# Perform 10-fold cross-validation
cv_results <- cv.glm(data = alc, glmfit = logistic_model, K = 10)
# Calculate the average error across folds
cv_error <- mean(cv_results$delta)
# Print the cross-validation error
print(paste("10-Fold Cross-Validation Error:", cv_error))
## [1] "10-Fold Cross-Validation Error: 0.19665616037216"
date()
## [1] "Mon Dec 11 19:44:45 2023"
The Boston data set is available in R and can be downloaded from the MASS package. It contains information on housing values in suburbs of Boston. Each row represents a different town, and each column are variables that provide various aspects of the towns.
# access the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
# explore the structure of the data set
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# exlpore the dimension of the data set
dim(Boston)
## [1] 506 14
Boston data set has 506 observations (rows) and 14 variables (columns), most variables are numeric except chas is integer.
Following shows a graphical overview and summaries of the variables in the data. Some distributions of variables and the relationships between them are also discussed.
# Look at the structure of the data set
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# plot matrix of the variables
pairs(Boston)
# Access tidyr and corrplot libraries
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
# calculate the correlation matrix and round it
cor_matrix <- Boston %>%
cor() %>%
round(2)
# print the correlation matrix
print(cor_matrix)
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6) %>% summary()
## Length Class Mode
## corr 196 -none- numeric
## corrPos 5 data.frame list
## arg 1 -none- list
The correlation matrix shows the pairwise correlations between variables. This indicates the strength and direction of linear relationships. Positive values indicate a positive linear relationship, negative values indicate a negative linear relationship, and values close to zero suggest a weak or no linear relationship. Since eyeballing the correlation matrix can be quite challenging with bigger data sets and also with this one, I use the corrplot visualization to intrepet the correaltions. Strong negative or positive correlations (closer to -1 or 1) have darker color (red close to -1 and blue close to 1.). Also size of the circle is indicating the magnitude of the correlation.
Some of the variables are highly correlated, either negatively or positively. Strongly negatively correlated values are i.e. lower status of the population (lstat) and median value of owner-occupied homes in $1000 (medv). Strongly positively correlated values are i.e. full-value property-tax rate per $10,000 (tax) and proportion of non-retail business acres per town (indus).
Because they are not all normally distributed scaling of the variables is needed.
Standardize the dataset is standardized as follows:
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
Then I’ll create a categorical variable crim (per capita
crime rate by town) to be our factor variable and cut the variable by
quantiles to get the high, low and middle rates of crime into their own
categories. After this I’ll remove the original crim variable from the
data set and substitute with a new one.
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Now, data will be divided to test and train sets, with 80 % of the data going to train set. We also save the correct classes from the test set and remove the crime variable from it.
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
The categorical crime rate is used as the target variable in the linear discriminant analysis. All the other variables in the data set are used as predictor variables.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2425743 0.2599010 0.2549505 0.2425743
##
## Group means:
## zn indus chas nox rm age
## low 0.9130483 -0.8950041 -0.151805591 -0.8615010 0.4291075 -0.8793412
## med_low -0.1005293 -0.2982934 -0.009855719 -0.5548611 -0.1619742 -0.3170439
## med_high -0.3773415 0.1638680 0.301035044 0.3483259 0.2413675 0.4155388
## high -0.4872402 1.0171960 -0.031282114 1.1029298 -0.3896971 0.8153299
## dis rad tax ptratio black lstat
## low 0.8270651 -0.6888949 -0.7226799 -0.37632230 0.3710387 -0.73771431
## med_low 0.3471767 -0.5476413 -0.4618963 -0.06568307 0.3164061 -0.12329985
## med_high -0.3511543 -0.3753022 -0.2915106 -0.30324295 0.1370716 -0.02821847
## high -0.8602633 1.6373367 1.5134896 0.77985517 -0.7695333 0.91500117
## medv
## low 0.490266324
## med_low -0.007709118
## med_high 0.273324689
## high -0.688898102
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.074552946 0.84966455 -0.91589354
## indus 0.018170689 -0.34323897 0.24092745
## chas -0.080851312 -0.06197874 0.04358941
## nox 0.468065359 -0.68162882 -1.38284512
## rm -0.073408090 -0.17123796 -0.26652658
## age 0.298177360 -0.37306911 -0.05579778
## dis 0.021937737 -0.51239207 0.13218308
## rad 2.973023616 0.76586908 -0.01464469
## tax -0.009623886 0.08188377 0.55558121
## ptratio 0.107331709 0.09803687 -0.32259536
## black -0.133389954 -0.01831856 0.11223747
## lstat 0.244326768 -0.30774150 0.33087817
## medv 0.237565655 -0.48836623 -0.17810951
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9437 0.0419 0.0144
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen=2, col=classes)
Next I’d save the crime categories from the test set and remove the categorical crime variable from the test dataset. Then I’d predict the crime rate in the test set using the model from the train set and cross-tabulate it with the correct classes. However, I got error from this code and were not able to fix it in time.
Code to save the correct classes from test data: correct_classes <- test$crime
Code to remove the crime variable from test data: test <- dplyr::select(test, -crime)
Code to predict classes with test data: lda.pred <- predict(lda.fit, newdata = test)
Code to cross tabulate the results: table(correct = correct_classes, predicted = lda.pred$class)
data("Boston")
boston_scaled <- scale(Boston)
boston_euk_dist <- dist(boston_scaled)
Let’s run the k-means clustering with 3 clusters and visualise the result with few relevant variables.
# k-means clustering
km <-kmeans(boston_scaled, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled[,6:10], col = km$cluster)
Doesn’t look that good so let’s try 2 instead of 3
# k-means clustering
km <-kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled[,6:10], col = km$cluster)
date()
## [1] "Mon Dec 11 19:44:47 2023"
# Load required libraries
library(dplyr)
library(readr)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# Specify the file path into CSV file
file_path <- "/Users/minmaunu/IODS-project/data/human.csv"
# Read the CSV file into a data frame
human <- read_csv(file_path)
## Rows: 148 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Edu_Ratio, LF_Ratio, Edu.Exp, Life.Exp, GNI, Mat.Mor, Ado.Birth, Pa...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Explore the datasets
# See the structure of the data.
str(human)
## spc_tbl_ [148 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Country : chr [1:148] "Norway" "Australia" "Switzerland" "Denmark" ...
## $ Edu_Ratio: num [1:148] 1.007 0.997 0.983 0.989 0.969 ...
## $ LF_Ratio : num [1:148] 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num [1:148] 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num [1:148] 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : num [1:148] 64992 42261 56431 44025 45435 ...
## $ Mat.Mor : num [1:148] 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num [1:148] 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num [1:148] 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
## - attr(*, "spec")=
## .. cols(
## .. Country = col_character(),
## .. Edu_Ratio = col_double(),
## .. LF_Ratio = col_double(),
## .. Edu.Exp = col_double(),
## .. Life.Exp = col_double(),
## .. GNI = col_double(),
## .. Mat.Mor = col_double(),
## .. Ado.Birth = col_double(),
## .. Parli.F = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# See the dimension of the data.
dim(human)
## [1] 148 9
# move the country names to rownames
library(tibble)
human_ <- column_to_rownames(human, "Country")
# Access GGally
library(GGally)
# visualize the 'human_' variables
ggpairs(human_, progress = FALSE)
# Check the summary
summary(human_)
## Edu_Ratio LF_Ratio Edu.Exp Life.Exp
## Min. :0.1980 Min. :0.1857 Min. : 7.00 Min. :49.00
## 1st Qu.:0.8097 1st Qu.:0.5971 1st Qu.:11.57 1st Qu.:68.38
## Median :0.9444 Median :0.7504 Median :13.55 Median :74.35
## Mean :0.8766 Mean :0.7006 Mean :13.42 Mean :72.44
## 3rd Qu.:0.9990 3rd Qu.:0.8385 3rd Qu.:15.32 3rd Qu.:77.45
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 680 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 5190 1st Qu.: 11.0 1st Qu.: 12.18 1st Qu.:12.15
## Median : 12408 Median : 45.5 Median : 31.30 Median :19.50
## Mean : 18402 Mean :120.9 Mean : 43.72 Mean :20.95
## 3rd Qu.: 25350 3rd Qu.:170.0 3rd Qu.: 69.05 3rd Qu.:27.82
## Max. :123124 Max. :730.0 Max. :175.60 Max. :57.50
Most variables seem to be quite normally distributed, except GNI, maternal mortality and adolescent birth rate are skewed. There are many variables that correlate between each other, such as expected education and life expectancy have strong positive correlation, where as maternal mortality life expectancy have strong negative correlation.
# perform PCA
pca_human <- prcomp(human_)
s <-summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("darkblue", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
There is a large variation in GNI (thus GNI explains 100% of the variation), so we need to standardize the data.
# standardize the human data
human_scaled <- scale(human_)
pca_human <- prcomp(human_scaled)
#print out the summary
s <- summary(pca_human)
# rounded percentanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 51.7 16.3 10.0 7.9 6.2 3.8 2.9 1.2
# New PCA with standardized data
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("darkblue", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
Biplot looks much better now after scaling! PC1 axis tells us that 51% of the variation can be explained by education, life expactancy and GNI, when maternal mortality and adolescent birth rate (arrow to the right) correlate negatively with these variables (arrow to the left). The PC2 axis shows that women’s working and representation in parlament explain 16.2% of the variation.
# Load the data into R
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
# Load needed libraries
library(FactoMineR)
library(tidyr)
library(ggplot2)
# Explore the data set
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
View(tea)
The Tea data set has 300 observations and 36 variables. It contains data from tea consumption and habits. Variable Age is integer while all others are factor variables.
# Visualize the data set
gather(tea) %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7))
## Warning: attributes are not identical across measure variables; they will be
## dropped
Unfortunately I don’t know why this figure looks like this. Not happy with it!
# column names to keep in the dataset
keep_columns <- c("Tea", "frequency", "sex", "work", "age_Q")
# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(keep_columns)
##
## # Now:
## data %>% select(all_of(keep_columns))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# look at the summaries and structure of the data
summary(tea_time)
## Tea frequency sex work age_Q
## black : 74 +2/day :127 F:178 Not.work:213 +60 :38
## Earl Grey:193 1 to 2/week: 44 M:122 work : 87 15-24:92
## green : 33 1/day : 95 25-34:69
## 3 to 6/week: 34 35-44:40
## 45-59:61
str(tea_time)
## 'data.frame': 300 obs. of 5 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ frequency: Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
# visualize the dataset
library(ggplot2)
pivot_longer(tea_time, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Based on the selected variables, 15-24 year old drink the most tea. Most drink tea more than 2 times a day, with women drinking slightly more than men. The most popular type of tea is Earl Grey compared to green or black tea. People are more than 2 times more likely to drink tea outside the workplace. ***
date()
## [1] "Mon Dec 11 19:44:51 2023"
# Load required libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ purrr 1.0.2
## ✔ lubridate 1.9.3 ✔ stringr 1.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ MASS::select() masks dplyr::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
# Read the data into R
RATSL <- read.table("/Users/minmaunu/IODS-project/data/RATSL.txt", header = TRUE)
# Change the categorical variables to factors
RATSL$ID <- as.factor(RATSL$ID)
RATSL$Group <- as.factor(RATSL$Group)
# Take a glimpse on the data
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
Data has 176 rows and 5 columns. Everything seems to be in order and data is ready for the analysis.
Let’s draw a line plot of the rat weights in diet groups 1-3
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
High start weight means high end weight, so the data should be standardized.
RATSL <- RATSL %>%
group_by(Time) %>%
mutate(stdweight = (Weight-mean(Weight))/sd(Weight)) %>%
ungroup()
ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$stdweight), max(RATSL$stdweight)), name = "standardized bprs")
Looks better. Next let’s look at the response to the diets.
n <- RATSL$Time %>% unique() %>% length()
RATS_trmt <- RATSL %>% group_by(Group, Time) %>% summarise(mean = mean(Weight), se = sd(Weight)/sqrt(n))
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
glimpse(RATS_trmt)
## Rows: 33
## Columns: 4
## Groups: Group [3]
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, …
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2…
## $ se <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3.30…
ggplot(RATSL, aes(x= as.factor(Time), y=Weight, fill=Group)) +
geom_boxplot() +
theme(legend.position = c(0.8,0.4), panel.grid = element_blank(), panel.background = element_blank(), panel.border = element_rect(fill = FALSE), legend.key = element_rect(fill=FALSE)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)") +
scale_x_discrete(name = "Time") +
scale_fill_manual(values=c("pink", "forestgreen", "lightblue")) +
ggtitle("Rat weight over time according to different diets ")
On average diet 1 results lower weight compared to diet 2 and 3 in rats. It seems that diet 2 has the biggest variation on rats.
Next let’s look at the outlier of the data
RATSL8 <- RATSL %>%
group_by(Group, ID) %>%
summarize(mean = mean(Weight)) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(RATSL8, aes(x=Group, y=mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
theme(legend.position = c(0.8,0.8), panel.grid = element_blank(), panel.background = element_blank(), panel.border = element_rect(fill = FALSE)) +
scale_y_continuous(name = "mean(Weight) per group")
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
One outlier can be found from each group, but in group 2 it clearly differs from the mean. Let’s remove it and draw the boxplot again!
RATSL8S1 <- filter(RATSL8, (Group=="2" & mean<500) | Group!="2")
ggplot(RATSL8S1, aes(x = Group, y = mean)) +
geom_boxplot() +
theme(legend.position = c(0.8,0.8), panel.grid = element_blank(), panel.background = element_blank(), panel.border = element_rect(fill = FALSE)) +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight) per group")
# Read the data
BPRSL <- read.table("/Users/minmaunu/IODS-project/data/BPRSL.txt", header = TRUE)
# Change the categorical variables to factors
BPRSL$treatment <- as.factor(BPRSL$treatment)
BPRSL$subject <- as.factor(BPRSL$subject)
# Take a glimpse on the data
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
Data has 360 rows and 5 columns. Everything seems to be in order and data is ready for the analysis.
ggplot(BPRSL, aes(x = week, y = bprs, fill=subject, color = treatment)) +
geom_line(aes(linetype = treatment)) +
theme(legend.position = c(0.8,0.8), panel.grid = element_blank(), panel.background = element_blank(),
panel.border = element_rect(fill = FALSE), legend.key = element_rect(fill=FALSE)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 2), expand=c(0,0)) +
scale_y_continuous(name = "bprss") +
scale_color_manual(values=c("pink", "forestgreen")) +
ggtitle("Symptoms in patients over time")
It’s quite busy plot, but just by eyeballing it seems that with the treatment 1 the symptoms have slightly decreased within time, whereas with treatment 2 there is more variation towards the end.
Next let’s fit a Random Intercept Model.
# Load required libraries
library(Matrix)
library(lme4)
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
Now let’s fit a random intercept and random slope model and compare it to the random intercept model.
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
anova(BPRS_ref, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s try interaction between week and treatment and compare again.
BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the chi-squared statistics and p-value of the likelihood ratio test, random intercept and random slope model is the best. Let’s plot:
library(ggplot2)
library(dplyr)
library(tidyr)
library(gridExtra)
p1 <- ggplot(BPRSL, aes(x = week, y = bprs, fill = subject, color = treatment)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 2)) +
scale_y_continuous(name = "Observed bprs") +
scale_color_manual(values=c("pink", "forestgreen")) +
ggtitle("Symptoms in patients over time") +
theme(panel.grid = element_blank(), panel.background = element_blank(),
panel.border = element_rect(fill = FALSE), legend.key = element_rect(fill=FALSE), legend.position = "bottom",)
Fitted <- fitted(BPRS_ref1)
BPRSL <- mutate(BPRSL, fitted=Fitted)
p2 <- ggplot(BPRSL, aes(x = week, y = fitted, fill = subject, color = treatment)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 2)) +
scale_y_continuous(name = "Fitted bprs") +
scale_color_manual(values=c("pink", "forestgreen")) +
ggtitle(" ") +
theme(panel.grid = element_blank(), panel.background = element_blank(),
panel.border = element_rect(fill = FALSE), legend.key = element_rect(fill=FALSE), legend.position = "bottom")
# p1 and p2 side by side
grid.arrange(p1, p2, ncol=2)
Based on the fitted model, we can see thath the treatments are after all equally effective in reducing the symptoms.
Done!